Near Chromosome-Level Genome Assembly and Annotation of Rhodotorula babjevae Strains Reveals High Intraspecific Divergence

The genus Rhodotorula includes basidiomycetous oleaginous yeast species. Rhodotorula babjevae can produce compounds of biotechnological interest such as lipids, carotenoids, and biosurfactants from low value substrates such as lignocellulose hydrolysate. High-quality genome assemblies are needed to develop genetic tools and to understand fungal evolution and genetics. Here, we combined short- and long-read sequencing to resolve the genomes of two R. babjevae strains, CBS 7808 (type strain) and DBVPG 8058, at chromosomal level. Both genomes are 21 Mbp in size and have a GC content of 68.2%. Allele frequency analysis indicates that both strains are tetraploid. The genomes consist of a maximum of 21 chromosomes with a size of 0.4 to 2.4 Mbp. In both assemblies, the mitochondrial genome was recovered in a single contig, that shared 97% pairwise identity. Pairwise identity between most chromosomes ranges from 82 to 87%. We also found indications for strain-specific extrachromosomal endogenous DNA. A total of 7591 and 7481 protein-coding genes were annotated in CBS 7808 and DBVPG 8058, respectively. CBS 7808 accumulated a higher number of tandem duplications than DBVPG 8058. We identified large translocation events between putative chromosomes. Genome divergence values between the two strains indicate that they may belong to different species.


Introduction
Oleaginous yeasts have received considerable attention in recent years due to many potential biotechnological applications of microbial lipids. Rhodotorula species are basidiomycetous oleaginous yeasts whose lipid production accounts for more than 70% of dry cell weight. They show high tolerance to inhibitors, enabling them to convert lignocellulosic hydrolysates into lipids [1][2][3][4]. Microbial lipids from R. babjevae and other oleaginous yeasts have a fatty acid composition similar to vegetable oils and represent an environmentally and ethically suitable alternative raw material for the production of biofuels, oleochemicals, feed, and food additives [2,5,6]. Under nitrogen-limited conditions, R. babjevae can simultaneously accumulate biotechnologically important enzymes, glycolipids, and carotenoids [5]. Glycolipids from R. babjevae have promising environmental applications in biodegrading hydrocarbon pollutants and replacing synthetic compounds and chemical surfactants [7][8][9]. They are also attractive for other applications in various industrial sectors due to their

Library Preparation and Sequencing
The extracted DNA samples were sequenced using MinION (Oxford Nanopore Technologies, Oxford, UK) and the Illumina sequencing platform. Nanopore DNA libraries were prepared according to [23]. Briefly, 31.5 µL of AMPure magnetic beads were added to 5 µg of DNA for a "pre-cleaning" step. Library preparation was then performed according to a modified protocol [23] using a Ligation Sequencing Kit (SQK-LSK109, Oxford Nanopore Technologies, Oxford, UK). Each DNA library was loaded onto a FLO-MIN106 flow cell mounted on a MINION device (Oxford Nanopore Technologies). MinKNOW software (version 19.06.8) was used for sequencing as described by [23]. The base calling was run using Guppy version 3.2.4-1-195590e and model HAC-mod (modified base sensitive high accuracy model).
From the 6,665,174 long reads recovered from the CBS 7808 DNA library, the mean read length was 2789.7 bases and the read length N50 5553 bases yielding a total of 18,593 Mbp. For DBVPG 8058, 2,953,255 long reads were retrieved containing a total of 15,702 Mbp.
The mean read length was 5317 bases and the read length N50 7411 bases. Aliquots of the extracted DNA from both R. babjevae strains were also subjected to short-read paired-end sequencing using the Illumina Novaseq platform (S prime, 2× 150 bp) and the TruSeq PCR free DNA library preparation kit (Illumina Inc., San Diego, CA, USA). 179,163,622 short reads were recovered from CBS 7808 DNA library, corresponding to a total of 27,053 Mbp. For DBVPG 8058, 203,873,550 short reads were retrieved containing a total of 30,784 Mbp.

Genome Assembly and Annotation
Genome assembly and annotation was performed using a custom pipeline described elsewhere [3], applying the program versions listed in Table S1. To further improve the annotation of transcripts and exon-intron boundaries, we additionally mapped RNA-Seq data from the closely related R. toruloides CBS 14 (PRJEB40807) to the R. babjevae genomes as previously described [3]. We used nQuire (v0.0) based on minimap2 short-read mappings (v2.17; no secondary alignments option) and the KmerCountExact script from the BBMap package (https://sourceforge.net/projects/bbmap/; accessed on 25 November 2021) (v38.86) to estimate the ploidy level of the R. babjevae strains [24,25]. To compare these methods and our ploidy results of the two R. babjevae strains with already published results, we also performed nQuire and KmerCountExact on Illumina sequencing data from Rhodotorula mucilaginosa JGTA-S1, accession number SRR5821556 [12].
The reconstruction of lipid metabolic pathway maps was performed using KEGG Mapper version 4.3. The KEGG Orthology (KO) identifiers were affiliated to the annotated transcripts of R. babjevae CBS 7808 and R. babjevae DBVPG 8058 using KofamKOALA [26] with an e-value cut-off of 0.01.

Genome Divergence Analysis
Synteny relationship analysis between R. babjevae CBS 7808 and R. babjevae DBVPG 8058 was performed using NUCmer (MUMmer, version 3.23). The maximum gap between adjacent matches in a cluster were set to 500 and the minimum cluster length to 100. Visualization of NUCmer alignments and other genomic features was performed with Circa (http://omgenomics.com/circa; accessed on 23 June 2021).
Whole genome comparison and identification of orthologous gene clusters and paralogous genes were performed on the web-based OrthoVenn2 platform (https://orthovenn2 .bioinfotoolkits.net; accessed on 20 October 2021) using a threshold e-value of 1 × 10 −15 and an inflation of 1.5 [33]. To identify duplicated genes (paralogs) with high sequence identity, an all-against-all sequence identity search was performed on the NCBI Genome Workbench version 3.7.0 [34] using BLASTp (BLOSUM62 matrix) with a cut-off e-value of 1 × 10 −15 . The output file was screened for protein sequences with at least 70% coverage and 70% sequence identity.

Genome Assembly, Ploidy Estimation, and Gene Annotation of R. babjevae Strains
The genome of both R. babjevae strains was assembled by a combined approach of long-and short-read sequencing with a coverage depth of about 2000 X. A summary of the genomic data is presented in Table 1. The CBS 7808 draft genome has an overall size of 21,862,387 bp and a GC content of 68.23%. Repetitive sequences make up 5.93% of the total length of the genome, of which 4.98% are single repeats and 0.96% are regions of low complexity. The draft genome of DBVPG 8058 has a total size of 21,522,072 bp and a GC content of 68.24%. The approach identified 6.73% as repetitive sequences, including 5.65% as single repeats and 1.09% as regions of low complexity. The similarity of genome features, such as genome size, GC content, and percentage of repetitive regions, confirms that they are closely related species. The genome size is comparable to that of other Rhodotorula species, but the GC content is slightly higher [3,[11][12][13]15,18] (Table 1). Sequence assembly resulted for R. babjevae CBS 7808 in 24 contigs and three scaffolds with a length N50 of 1,067,634 bp ( Figure 1a, Table S2). A telomeric region was predicted at one of the termini for 13 contigs and scaffolds larger than 250,000 bp. The draft genome of strain DBVPG 8058 consists of 33 contigs and one scaffold with a length N50 of 789,767 bp ( Figure 1b, Table S3). From the contigs and scaffolds with sizes larger than 250,000 bp in DBVPG 8058 genome assembly, two have telomere sequences at both termini and 15 at one terminus each. The low numbers of contigs and scaffolds in the genome assemblies from both R. babjevae strains indicate high accuracy, contiguity, and completeness. Two putative circular sequences were identified in each strain. Among them, contig_2 in CBS 7808 and contig_79 in DBVPG 8058 contained the mitochondrial genes. Both mitochondrial genomes are similar in size with 30.876 bp and 28.432 bp, respectively, and have a GC content of 38.9% (Tables S2 and S3). mitochondrial genomes are similar in size with 30.876 bp and 28.432 bp, respectively, and have a GC content of 38.9% (Tables S2 and S3).  To estimate the ploidy in R. babjevae strains, we used nQuire. nQuire quantifies the distribution of the base frequencies at variable sites, and thus differentiates between different degrees of ploidy [24]. In both strains, the alleles occurred at frequencies of about 25% and 75%, indicating that both R. babjevae strains are tetraploid ( Figure 2). Furthermore, we also used a k-mer counting approach to estimate ploidy. Using a k-mer length of 31, as recently shown by Sen et al. [12] for R. mucilaginosa JGTA-S1, only one peak FA S1 To estimate the ploidy in R. babjevae strains, we used nQuire. nQuire quantifies the distribution of the base frequencies at variable sites, and thus differentiates between different degrees of ploidy [24]. In both strains, the alleles occurred at frequencies of about 25% and 75%, indicating that both R. babjevae strains are tetraploid ( Figure 2). Furthermore, we also used a k-mer counting approach to estimate ploidy. Using a k-mer length of 31, as recently shown by Sen et al. [12] for R. mucilaginosa JGTA-S1, only one peak appears in the plots. However, when the k-mer length is reduced to 17, as recently shown by Zou et al. [35], two distinct peaks appear for both R. mucilaginosa JGTA-S1 and the R. babjevae strains ( Figure 2). The first and larger peak indicates tetraploidy while the second smaller peak indicates diploidy. appears in the plots. However, when the k-mer length is reduced to 17, as recently shown by Zou et al. [35], two distinct peaks appear for both R. mucilaginosa JGTA-S1 and the R. babjevae strains ( Figure 2). The first and larger peak indicates tetraploidy while the second smaller peak indicates diploidy. The ploidy level of R. babjevae strains has not been studied so far. The genomes of the closely related strains R. toruloides NP11 and R. mucilaginosa JGTA-S1 are considered to be haploid [12,15]. However, our analyses indicate that both R. mucilaginosa JGTA-S1 and R. babjevae CBS 7808 and DBVPG 8058 may be tetraploid. Tetraploidy has previously been widely recognized in yeast [36][37][38][39]. Knowing the ploidy level is of great importance for genetic engineering and for the development of efficient gene manipulation protocols.
A total of 7591 protein-coding genes and 7607 associated transcripts were annotated in the CBS 7808 genome using MetaEuk ( Table 1). The average number of estimated exons per gene is 3.97 (Table 1). The genome of DBVPG 8058 has 7481 protein-coding genes, 7516 associated transcripts and 3.93 estimated exons per gene ( Table 1). The proportion of split genes in both genomes is correspondingly high, amounting to 6390 and 6305 for CBS 7808 and DBVPG 8058, respectively. This is consistent with previous findings for Rhodotorula spp. [3,12,15]. The distribution of exon counts in the genomes of R. babjevae strains CBS 7808 and DBVPG 8058 is shown in Table S4. 315 and 309 open reading frames (ORF) complementary to annotated genes were predicted in CBS 7808 and DBVPG 8058, respectively. The presence of antisense transcripts has previously been reported for the related species R. toruloides [3]. In yeast, the level of antisense transcription has been anti-correlated to sense mRNA, indicating antisense-dependent gene regulation through transcription interference under certain growth conditions [40,41]. Figures S1-S3 show the The ploidy level of R. babjevae strains has not been studied so far. The genomes of the closely related strains R. toruloides NP11 and R. mucilaginosa JGTA-S1 are considered to be haploid [12,15]. However, our analyses indicate that both R. mucilaginosa JGTA-S1 and R. babjevae CBS 7808 and DBVPG 8058 may be tetraploid. Tetraploidy has previously been widely recognized in yeast [36][37][38][39]. Knowing the ploidy level is of great importance for genetic engineering and for the development of efficient gene manipulation protocols.
A total of 7591 protein-coding genes and 7607 associated transcripts were annotated in the CBS 7808 genome using MetaEuk ( Table 1). The average number of estimated exons per gene is 3.97 (Table 1). The genome of DBVPG 8058 has 7481 protein-coding genes, 7516 associated transcripts and 3.93 estimated exons per gene ( Table 1). The proportion of split genes in both genomes is correspondingly high, amounting to 6390 and 6305 for CBS 7808 and DBVPG 8058, respectively. This is consistent with previous findings for Rhodotorula spp. [3,12,15]. The distribution of exon counts in the genomes of R. babjevae strains CBS 7808 and DBVPG 8058 is shown in Table S4. 315 and 309 open reading frames (ORF) complementary to annotated genes were predicted in CBS 7808 and DBVPG 8058, respectively. The presence of antisense transcripts has previously been reported for the related species R. toruloides [3]. In yeast, the level of antisense transcription has been anti-correlated to sense mRNA, indicating antisense-dependent gene regulation through transcription interference under certain growth conditions [40,41]. Figures S1-S3 show the assignment of genes to the Gene Ontology (GO) categories' biological processes, cellular components, and molecular functions, of which the top 10 are summarized in Figure 3a,b.
Benchmarking of universal single-copy orthologs (BUSCOs, using fungi_odb9) identified that 95.5% and 96.9% of the assessed genes in CBS 7808 and DBVPG 8058, respectively, were complete and single-copy ( Figure S6). This supports the high quality of the draft genome assemblies reported here. Furthermore, 0.7% and 0.3% of the assessed genes in CBS 7808 and DBVPG 8058, respectively, were fragmented and the rest were missing ( Figure S6). A small percentage of BUSCO genes might still be undetectable due to sequence regions with low coverage, repetitive elements, or assembly problems that cannot be solved even with the hybrid approach and would require additional sequencing and manual analysis. In addition, when a BUSCO gene was missing, there were either no significant matches or the BUSCO matches were below the range of values for the selected BUSCO profile. Finally, some marker genes that are part of the BUSCO "fungi" profile that we used as reference may not be part of the two R. babjevae strains.
assignment of genes to the Gene Ontology (GO) categories' biological processes, cellular components, and molecular functions, of which the top 10 are summarized in Figure 3a,b. A total of 2691 and 2660 CDS from CBS 7808 and DBVPG 8058, respectively, could be assigned KO numbers (Figure 3c). The biosynthesis of saturated and unsaturated fatty acids, glycerolipid metabolism, terpenoid backbone biosynthesis, carbon metabolism, and fatty acid metabolism are depicted in detail in Figures S4 and S5. Some examples of annotated genes that encode crucial enzymes for lipid and carotenoid metabolism are CDC19 ,  MAE1, MAE2, ACL1, ACL2, ACC1, FAS1, FAS2, OLE1, ACAD10, ACAD11, IBR3,  D6C81_05617, POT1, LRO1, HMG1, HCS1, ERG8, crtYB, crtI, and BTS1 (Tables S5 and S6). A difference in this respect is the absence of ACL2, and the presence of ACAD10 in DBVPG 8058.
Benchmarking of universal single-copy orthologs (BUSCOs, using fungi_odb9) identified that 95.5% and 96.9% of the assessed genes in CBS 7808 and DBVPG 8058, respectively, were complete and single-copy ( Figure S6). This supports the high quality of the draft genome assemblies reported here. Furthermore, 0.7% and 0.3% of the assessed genes in CBS 7808 and DBVPG 8058, respectively, were fragmented and the rest were missing ( Figure S6). A small percentage of BUSCO genes might still be undetectable due to sequence regions with low coverage, repetitive elements, or assembly problems that cannot be solved even with the hybrid approach and would require additional sequencing and manual analysis. In addition, when a BUSCO gene was missing, there were either no significant matches or the BUSCO matches were below the range of values for the selected BUSCO profile. Finally, some marker genes that are part of the BUSCO "fungi" profile that we used as reference may not be part of the two R. babjevae strains.

Chromosome Organization
The R. babjevae genome assemblies were aligned for comparison using NUCmer. Out of a total of 27 contigs and scaffolds in CBS 7808, 24 matched 30 of the 34 assembled sequences in DBVPG 8058 (Figure 4). In general, the number of undisturbed segments is high. However, there are also major chromosomal rearrangements (Figure 4). LASTZ alignments of each contig from one R. babjevae strain with the whole genome of the other strain confirmed the results of the synteny analysis (Table S7, Figures S7 and S8). Based on these alignments we deduce that R. babjevae has a maximum of 21 chromosomes with sizes ranging from 0.4 to 2.4 Mbp (Table 2, Figure 4). The molecular karyotype of several Saccharomyces yeast strains has been identified as 16 [42]. Karyotyping studies in Rhodotorula species have identified at least 10 chromosomes in isolates of R. mucilaginosa and 11 in R. toruloides while Martín-Hernández et al. proposed that R. toruloides CBS 14 has at least 18 chromosomes [3,43,44]. The pairwise identity between chromosomes ranges from 82% to 87%. The mitochondrial genomes have 97% pairwise identity (Table S7, Figure S7). Four of the putative chromosomes are affected by large translocation events. This affects chromosomes 3 and 6, and chromosomes 9 and 14 ( Table 2). Minor inversions were noticed in other chromosomes (Table S7, Figure S8). Each R. babjevae strain contains two contigs that are strain-specific (Tables S7 and S8). These are small linear contigs with higher read depths than the chromosomes, except for circular contig_26 in CBS 7808, which has a lower read depth than the chromosomes. These variations in read depth may indicate relaxed replication regulation. The linear DNA sequence from CBS 7808 con-tig_46 has two annotated genes, one of which encodes the Retrovirus-related Pol polyprotein from transposon 17.6. DNA plasmids have previously been found in filamentous fungi, including the close relative R. toruloides, with sizes ranging from 2.5 to 11 kb and typically encoding enzymes involved in plasmid replication [3,45,46]. This might indicate the presence of strain-specific extrachromosomal endogenous DNA.

Chromosome Organization
The R. babjevae genome assemblies were aligned for comparison using NUCmer. Out of a total of 27 contigs and scaffolds in CBS 7808, 24 matched 30 of the 34 assembled sequences in DBVPG 8058 ( Figure 4). In general, the number of undisturbed segments is high. However, there are also major chromosomal rearrangements (Figure 4). LASTZ alignments of each contig from one R. babjevae strain with the whole genome of the other strain confirmed the results of the synteny analysis (Table S7, Figures S7 and S8). Based on these alignments we deduce that R. babjevae has a maximum of 21 chromosomes with sizes ranging from 0.4 to 2.4 Mbp (Table 2, Figure 4). The molecular karyotype of several Saccharomyces yeast strains has been identified as 16 [42]. Karyotyping studies in Rhodotorula species have identified at least 10 chromosomes in isolates of R. mucilaginosa and 11 in R. toruloides while Martín-Hernández et al. proposed that R. toruloides CBS 14 has at least 18 chromosomes [3,43,44]. The pairwise identity between chromosomes ranges from 82% to 87%. The mitochondrial genomes have 97% pairwise identity (Table S7, Figure S7). Four of the putative chromosomes are affected by large translocation events. This affects chromosomes 3 and 6, and chromosomes 9 and 14 ( Table 2). Minor inversions were noticed in other chromosomes (Table S7, Figure S8). Each R. babjevae strain contains two contigs that are strain-specific (Tables S7 and S8). These are small linear contigs with higher read depths than the chromosomes, except for circular contig_26 in CBS 7808, which has a lower read depth than the chromosomes. These variations in read depth may indicate relaxed replication regulation. The linear DNA sequence from CBS 7808 contig_46 has two annotated genes, one of which encodes the Retrovirus-related Pol polyprotein from transposon 17.6. DNA plasmids have previously been found in filamentous fungi, including the close relative R. toruloides, with sizes ranging from 2.5 to 11 kb and typically encoding enzymes involved in plasmid replication [3,45,46]. This might indicate the presence of strain-specific extrachromosomal endogenous DNA.   Figure S8).    Figure S8).    Chr., chromosome.

Genome Divergence Analysis
The genomes of the R. babjevae strains were compared to each other and to genomes of closely related Rhodotorula species in terms of DDH, ANI and K r for tracing genome divergence ( Figure 5, Table S9). The R. babjevae strains share 44.20% DDH estimates, 84.48% ANI and K r values of 0.09. In general, the genetic divergence between R. babjevae strains was comparable to the divergence with R. graminis and R. glutinis, but higher than expected for strains of the same yeast species [47]. For instance, the divergence between strains of R. toruloides was much lower than that of the two R. babjevae strains (Table S9). Moreover, the protein-coding sequences of the R. babjevae strains and their closest relatives R. graminis and R. glutinis were analyzed using OrthoVenn2 web platform to identify and compare orthologous gene clusters. The R. babjevae species share 6598 out of a total of 7223 orthologous clusters produced by OrthoVenn2, including both single-copy gene clusters and overlapping gene clusters such as paralogs ( Figure 6). Of the shared clusters, 5933 are common within the three Rhodotorula species assessed, representing putative shared orthologous proteins that evolved from common ancestral genes. In addition, CBS 7808 has 389 single genes and one cluster that had no orthologs in the other  Moreover, the protein-coding sequences of the R. babjevae strains and their closest relatives R. graminis and R. glutinis were analyzed using OrthoVenn2 web platform to identify and compare orthologous gene clusters. The R. babjevae species share 6598 out of a total of 7223 orthologous clusters produced by OrthoVenn2, including both single-copy gene clusters and overlapping gene clusters such as paralogs ( Figure 6). Of the shared clusters, 5933 are common within the three Rhodotorula species assessed, representing putative shared orthologous proteins that evolved from common ancestral genes. In addition, CBS 7808 has 389 single genes and one cluster that had no orthologs in the other genomes, while strain DBVPG 8058 has 355 single genes. These unique genes could account for the specific functional capabilities of the R. babjevae strains as a result of gene loss or gain events. Of the 79 orthologous clusters shared only between R. babjevae strains, some of the assigned GO terms are: Positive regulation of the unsaturated fatty acid biosynthetic process by positive regulation of transcription from RNA polymerase II promoter (GO:0036083), protein O-linked glycosylation (GO:0006493), glucan catabolic process (GO:0009251), cellular calcium ion homeostasis (GO:0006874), sulfate assimilation (GO:0000103), and carbohydrate transport (GO:0008643). The two R. babjevae strains show a high genome pairwise similarity and a high number of shared orthologous clusters, though not as high as for R. graminis and R. glutinis ( Figure 6). In general, R. babjevae, R. glutinis, and R. graminis are very closely related species with a short evolutionary distance between them as compared to other species in the genus (i.e., R. toruloides). The strains CBS 7808 and DBVPG 8058 have high interstrain variability and a greater evolutionary distance to R. graminis than to R. glutinis. genomes, while strain DBVPG 8058 has 355 single genes. These unique genes could account for the specific functional capabilities of the R. babjevae strains as a result of gene loss or gain events. Of the 79 orthologous clusters shared only between R. babjevae strains, some of the assigned GO terms are: Positive regulation of the unsaturated fatty acid biosynthetic process by positive regulation of transcription from RNA polymerase II promoter (GO:0036083), protein O-linked glycosylation (GO:0006493), glucan catabolic process (GO:0009251), cellular calcium ion homeostasis (GO:0006874), sulfate assimilation (GO:0000103), and carbohydrate transport (GO:0008643). The two R. babjevae strains show a high genome pairwise similarity and a high number of shared orthologous clusters, though not as high as for R. graminis and R. glutinis ( Figure 6). In general, R. babjevae, R. glutinis, and R. graminis are very closely related species with a short evolutionary distance between them as compared to other species in the genus (i.e., R. toruloides). The strains CBS 7808 and DBVPG 8058 have high interstrain variability and a greater evolutionary distance to R. graminis than to R. glutinis. A total of 59 and 30 paralogous gene clusters were identified in CBS 7808 and DBVPG 8058, respectively, using OrthoVenn2 (Tables S10 and S11). Applying a cut-off value of 70% sequence coverage to them, we identified 29 and 19 duplicated genes, respectively, that potentially have not diverged in function. On the other hand, an all-against-all protein sequence similarity search was performed in each of the two strains using BLASTp with an e-value of 1 × 10 −15 , 70% coverage, and 70% sequence identity. This resulted in a total of 34 and 21 duplicated sequences in CBS 7808 and DBVPG 8058, respectively, and a total of 41 and 29 duplicated sequences with 70% sequence coverage, respectively, that were identified by any of the tools (Figure 1, Tables S12-S14). The higher accumulation of duplicated genes in CBS 7808 could be related to a higher number of gene duplication A total of 59 and 30 paralogous gene clusters were identified in CBS 7808 and DBVPG 8058, respectively, using OrthoVenn2 (Tables S10 and S11). Applying a cut-off value of 70% sequence coverage to them, we identified 29 and 19 duplicated genes, respectively, that potentially have not diverged in function. On the other hand, an all-against-all protein sequence similarity search was performed in each of the two strains using BLASTp with an e-value of 1 × 10 −15 , 70% coverage, and 70% sequence identity. This resulted in a total of 34 and 21 duplicated sequences in CBS 7808 and DBVPG 8058, respectively, and a total of 41 and 29 duplicated sequences with 70% sequence coverage, respectively, that were identified by any of the tools (Figure 1, Tables S12-S14). The higher accumulation of duplicated genes in CBS 7808 could be related to a higher number of gene duplication events due to faster evolution of the strain. The majority of these duplications lies adjacent to each other or in close proximity. Tandem duplications have been suggested as a mechanism of adaptative evolution to changing environments [48]. They may have arisen through homologous recombination between sequences on sister chromatids or homologous chromosomes [48]. It has been reported that considerable redundancy of duplicate gene pairs persists even after 100 million years of evolution in Saccharomyces cerevisiae [49]. Some of the predicted functions of the genes, which are duplicated only in CBS 7808, are Uncharacterized protein C17G8.02 (NAD biosynthesis), Mannose-6-phosphate isomerase and Phosphoenolpyruvate carboxykinase (ATP) (carbon metabolism), Acetyl-CoA carboxylase (fatty acid metabolism), Alpha-ketoglutarate-dependent sulfonate dioxygenase, Sulfite reductase [NADPH] hemoprotein beta-component and Sulfite reductase [NADPH] subunit beta (Sulfur metabolism), and Probable quinate permease (import of quinic acid as a carbon source). Some of the duplicated genes involved in metabolic processes identified only in DBVPG 8058 are mitochondrial Aspartate aminotransferase (intracellular NAD(H) redox balance) and Leucine-rich repeat extensin-like protein 3 (AtLRX3, cell morphogenesis). In both strains, the most frequently duplicated gene is SRRM2, which codes for the Ser/Arg repetitive matrix protein 2 and is involved in mRNA splicing. Cwc21p is encoded by CWC21, an ortholog of human SRRM2 in S. cerevisiae. It has been proposed that it resides in the catalytic center of the spliceosome and possibly fulfills its role in response to changing cellular environmental conditions [50]. The predicted function Ser/Arg repetitive matrix protein 2 was annotated in 1055 genes in CBS 7808 and 1068 in DBVPG 8058. Alternative splicing is an essential driver of proteomic diversity and may potentially provide a high level of evolutionary plasticity.
The type strain CBS7808 of R. babjevae investigated here, was first isolated from herbaceous plants in Moscow, Russia [51]. R. babjevae DBVPG 8058 was isolated from wild apples in Uppland locality, Sweden. The phylogenetic placement of DBVPG 8058 to the R. babjevae species was performed by the Industrial Yeasts Collection DBVPG by aligning 5.8S-ITS rDNA and D1/D2 26S rDNA regions in a similar manner as illustrated in Figure 5. However, the genome divergence values (DDH, ANI and K r ) proved to be more sensitive for delineating Rhodotorula species. Phylogenetic placement based on the standard rDNA regions may not be sufficient to understand yeast diversity and species delineation, as shown before [47,52,53]. These R. babjevae strains showed different behavior during enzymatic cell wall degradation for DNA purification both in this study and in another study where xylose medium was used [54]. Highly dynamic genome structures have already been found in closely related yeast species [20,[55][56][57][58][59]. A dynamic genome structure of R. babjevae could enhance the physiological capabilities and thus the species' environmental adaptability. [60][61][62][63]. However, their genetic divergence suggests that they may belong to different species. A genome comparison study using whole genome sequences from different strains of closely related Rhodotorula species would allow gaining a deeper knowledge about their genome structure and evolution, as well as identifying new species.
Taxonomic classification using Sourmash [64] and the GenBank reference (https: //osf.io/4f8n3; accessed on 17 March 2022) assign to both genome assemblies: Eukaryota superkingdom, Basidiomycota phylum, Microbotryomycetes class, Sporidiobolales order, Sporidiobolaceae family, Rhodotorula genus, and Rhodotorula graminis species. The taxonomic classification might indicate that R. graminis was the closest relative of R. babjevae with available genomic data. Previous studies have shown a close evolutionary relationship between R. babjevae and R. graminis, which was also demonstrated here [13,65].

Conclusions
The hybrid sequencing approach resulted in high-resolution genomes of R. babjevae DBVPG 8058 and CBS 7808 T . Both strains are tetraploid and have a maximum of 21 chromosomes. Some of the chromosomes show large-scale translocation events. Moreover, we demonstrated a high genome divergence between the R. babjevae strains, as high as the divergence to other closely related Rhodotorula species. This indicates that the two strains do not belong to the same species.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/jof8040323/s1, Figure S1: Gene Ontology (GO) term summary related to the GO topic: molecular functions. Figure S2: Gene Ontology (GO) term summaries belonging to the GO topic: biological processes. Figure S3: Gene Ontology (GO) term summaries belonging to the GO topic: cellular components. Figure S4: Examples of lipid metabolism pathways in Rhodotorula babjevae CBS 7808 reconstructed by KEGG Mapper. Figure S5: Examples of lipid metabolism pathways in Rhodotorula babjevae DBVPG 8058 reconstructed by KEGG Mapper. Figure S6: Quantitative assessment of the hybrid genome assemblies and annotation completeness using Benchmarking Universal Single-Copy Orthologs (BUSCO). Figure S7: LASTZ alignment of the mitochondrial genome sequences of Rhodotorula babjevae CBS 7808 and DBVPG 8058. Figure S8: LASTZ alignment of contigs with assigned homology of Rhodotorula babjevae CBS 7808 and DBVPG 8058 representing putative chromosomes. Table S1: Program versions used for the genome assembly and annotation pipeline. Table S2: Characteristics from the contigs and scaffolds of Rhodotorula babjevae CBS 7808 genome assembly. Table S3: Characteristics from the contigs and scaffolds of Rhodotorula babjevae DBVPG 8058 genome assembly. Table S4: Distribution of exon counts in the two strains of Rhodotorula babjevae. Table S5: Examples of lipid and carotenoid metabolism related genes in Rhodotorula babjevae CBS 7808. Table S6: Examples of lipid and carotenoid metabolism related genes in Rhodotorula babjevae DBVPG 8058. Table S7: Scaffold and contigs with assigned homology and pairwise nucleotide identity between Rhodotorula babjevae strains. Table S8: Summary of features of strain-unique contigs in Rhodotorula babjevae. Table  S9: Genetic divergence between Rhodotorula babjevae strains and closely related Rhodotorula species. Table S10: Alignment statistics of the duplicated genes identified in the Rhodotorula babjevae CBS 7808 genome by OrthoVenn2. Table S11: Alignment statistics of the duplicated genes identified in the Rhodotorula babjevae DBVPG 8058 genome by OrthoVenn2. Table S12: Alignment statistics of the duplicated genes identified in the Rhodotorula babjevae CBS 7808 genome by BLASTp. Table S13: Alignment statistics of the duplicated genes identified in the Rhodotorula babjevae DBVPG 8058 genome by BLASTp. Table S14: Duplicated genes in Rhodotorula babjevae identified by BLASTp and OrthoVenn2 with a minimum coverage of 70%.